Skip to content

Conversation

@nhanford
Copy link
Contributor

Hello @soumagne,

This PR adds a variant that builds and persistently installs the performance tests for Mercury.
Please let me know in which versions these CMake flags were introduced.

Preliminary results from a system similar to El Capitan at LLNL (200Gbps/HSA):

na_bw_get -c ofi -p cxi -b -l 1000
...
# [859097.930837] mercury->op [warning] /var/tmp/nhanford/spack-stage/spack-stage-mercury-master-ks4eadvpsbbqrwjxzxkebaxqymczvpry/spack-src/src/na/na_ofi.c:6950 na_ofi_cq_readerr() fi_cq_readerr() got err: 5 (Input/output error), prov_errno: 18 (ENTRY_NOT_FOUND)
# [859097.930842] mercury->op [error] /var/tmp/nhanford/spack-stage/spack-stage-mercury-master-ks4eadvpsbbqrwjxzxkebaxqymczvpry/spack-src/src/na/na_ofi.c:7137 na_ofi_cq_process_error() error event on operation ID 0x555555667670 (NA_CB_GET), fi_readmsg(iov_count=1, desc[0]=0x555555608680, msg_iov[0].iov_base=0x5555976f1000, msg_iov[0].iov_len=16384, addr=1, rma_iov_count=1, rma_iov[0].addr=0x3e000000, rma_iov[0].len=16384, rma_iov[0].key=0x900006b999e70000, context=0x5555556678b0, data=0) failed, rc: 5 (Input/output error)
16384                     15224.32                    1.03
32768                     22036.48                    1.42
65536                     22591.44                    2.77
131072                    22581.05                    5.54
262144                    23008.03                   10.87
524288                    23060.44                   21.68
1048576                   23117.53                   43.26
2097152                   23139.96                   86.43
4194304                   23152.97                  172.76
8388608                   23159.87                  345.43
16777216                  23163.12                  690.75

Thanks,
Nate

alalazo
alalazo previously approved these changes Jan 7, 2026
@alalazo alalazo self-assigned this Jan 7, 2026
@soumagne
Copy link
Contributor

soumagne commented Jan 9, 2026

@nhanford Apologies for the slow response. BUILD_TESTING_PERF was added in 2.3.0. You can add it next to that:

        if "@2.3.0:" in spec:
            cmake_args.append(define("BUILD_TESTING_UNIT", self.run_tests))

BUILD_TESTING has always been there as far as I recall. For some reason I had the impression that this was already supported in the spack recipe...

@soumagne
Copy link
Contributor

soumagne commented Jan 9, 2026

Also unrelated to that, @nhanford the error that you're getting when running that perf test with cxi is something that we've been investigating. Could you please let me know if you have seen that issue frequently when running that particular benchmark ? Thanks.

@nhanford
Copy link
Contributor Author

nhanford commented Jan 9, 2026

@soumagne Thanks for the info. This version should be a lot cleaner. Unfortunately I was not able to test it on my system because the build failed for me due to some CMake issues, but this should work.
Yes I observed that failure for smaller message sizes persistently across many tests. The system under test is identical to LLNL El Capitan (Cray EX, Slingshot, AMD CPU/APU) except it uses Slingshot Host Software 13 and CXI VNIs set using Flux.


if "@2.3.0:" in spec:
cmake_args.append(define("BUILD_TESTING_UNIT", self.run_tests))
cmake_args.append(define("BUILD_TESTING_PERF", self.run_tests))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could add a perf variant instead ? The idea behind having separate variables was that in most cases users want to be able to install the perf utilities but do not want to bother building the whole test suite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so having instead something like:

Suggested change
cmake_args.append(define("BUILD_TESTING_PERF", self.run_tests))
cmake_args.append(define_from_variant("BUILD_TESTING_PERF", "perf"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed over to this approach. However, I am unable to get the perf tests to build or install this way...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right let me try to tidy this up for you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nhanford please apply the following patch to your branch:

diff --git a/repos/spack_repo/builtin/packages/mercury/package.py b/repos/spack_repo/builtin/packages/mercury/package.py
index 1afa12b13b..29fb0db8b6 100644
--- a/repos/spack_repo/builtin/packages/mercury/package.py
+++ b/repos/spack_repo/builtin/packages/mercury/package.py
@@ -91,10 +91,12 @@ class Mercury(CMakePackage):
         spec = self.spec
         define = self.define
         define_from_variant = self.define_from_variant
+        build_tests = self.run_tests or self.spec.satisfies("@2.3.0:+perf")
         parallel_tests = "+mpi" in spec and self.run_tests
 
         cmake_args = [
             define_from_variant("BUILD_SHARED_LIBS", "shared"),
+            define("BUILD_TESTING", build_tests),
             define("MERCURY_USE_BOOST_PP", True),
             define_from_variant("MERCURY_USE_CHECKSUMS", "checksum"),
             define("MERCURY_USE_SYSTEM_MCHECKSUM", False),
@@ -102,7 +104,6 @@ class Mercury(CMakePackage):
             define_from_variant("NA_USE_BMI", "bmi"),
             define_from_variant("NA_USE_MPI", "mpi"),
             define_from_variant("NA_USE_SM", "sm"),
-            define("BUILD_TESTING", self.run_tests),
         ]
 
         if "@2.3.0:" in spec:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works! I think we're ready to merge after checks pass!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks again for adding that!

@soumagne
Copy link
Contributor

soumagne commented Jan 9, 2026

@soumagne Thanks for the info. This version should be a lot cleaner. Unfortunately I was not able to test it on my system because the build failed for me due to some CMake issues, but this should work. Yes I observed that failure for smaller message sizes persistently across many tests. The system under test is identical to LLNL El Capitan (Cray EX, Slingshot, AMD CPU/APU) except it uses Slingshot Host Software 13 and CXI VNIs set using Flux.

thanks this is very useful information. If you are able to reproduce this error consistently, could you please run with the following env vars set HG_LOG_LEVEL=warn HG_LOG_SUBSYS=hg,na,libfabric and file an issue in mercury's github ? we can follow up there. Thanks!

@nhanford
Copy link
Contributor Author

@soumagne Thanks for the info. This version should be a lot cleaner. Unfortunately I was not able to test it on my system because the build failed for me due to some CMake issues, but this should work. Yes I observed that failure for smaller message sizes persistently across many tests. The system under test is identical to LLNL El Capitan (Cray EX, Slingshot, AMD CPU/APU) except it uses Slingshot Host Software 13 and CXI VNIs set using Flux.

thanks this is very useful information. If you are able to reproduce this error consistently, could you please run with the following env vars set HG_LOG_LEVEL=warn HG_LOG_SUBSYS=hg,na,libfabric and file an issue in mercury's github ? we can follow up there. Thanks!

Since these errors appear not to affect performance and only seem to appear on our test system, I'm going to try to run this down locally and see how far I can get, and then I will file a bug if the issue persists. Thanks!

@soumagne
Copy link
Contributor

@soumagne Thanks for the info. This version should be a lot cleaner. Unfortunately I was not able to test it on my system because the build failed for me due to some CMake issues, but this should work. Yes I observed that failure for smaller message sizes persistently across many tests. The system under test is identical to LLNL El Capitan (Cray EX, Slingshot, AMD CPU/APU) except it uses Slingshot Host Software 13 and CXI VNIs set using Flux.

thanks this is very useful information. If you are able to reproduce this error consistently, could you please run with the following env vars set HG_LOG_LEVEL=warn HG_LOG_SUBSYS=hg,na,libfabric and file an issue in mercury's github ? we can follow up there. Thanks!

Since these errors appear not to affect performance and only seem to appear on our test system, I'm going to try to run this down locally and see how far I can get, and then I will file a bug if the issue persists. Thanks!

Thanks! that will be much appreciated as we've been trying to narrow this down...

@soumagne soumagne requested a review from alalazo January 15, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants